19 research outputs found
Graph-based Semi-Supervised & Active Learning for Edge Flows
We present a graph-based semi-supervised learning (SSL) method for learning
edge flows defined on a graph. Specifically, given flow measurements on a
subset of edges, we want to predict the flows on the remaining edges. To this
end, we develop a computational framework that imposes certain constraints on
the overall flows, such as (approximate) flow conservation. These constraints
render our approach different from classical graph-based SSL for vertex labels,
which posits that tightly connected nodes share similar labels and leverages
the graph structure accordingly to extrapolate from a few vertex labels to the
unlabeled vertices. We derive bounds for our method's reconstruction error and
demonstrate its strong performance on synthetic and real-world flow networks
from transportation, physical infrastructure, and the Web. Furthermore, we
provide two active learning algorithms for selecting informative edges on which
to measure flow, which has applications for optimal sensor deployment. The
first strategy selects edges to minimize the reconstruction error bound and
works well on flows that are approximately divergence-free. The second approach
clusters the graph and selects bottleneck edges that cross cluster-boundaries,
which works well on flows with global trends
Anchored Speech Recognition with Neural Transducers
Neural transducers have achieved human level performance on standard speech
recognition benchmarks. However, their performance significantly degrades in
the presence of cross-talk, especially when the primary speaker has a low
signal-to-noise ratio. Anchored speech recognition refers to a class of methods
that use information from an anchor segment (e.g., wake-words) to recognize
device-directed speech while ignoring interfering background speech. In this
paper, we investigate anchored speech recognition to make neural transducers
robust to background speech. We extract context information from the anchor
segment with a tiny auxiliary network, and use encoder biasing and joiner
gating to guide the transducer towards the target speech. Moreover, to improve
the robustness of context embedding extraction, we propose auxiliary training
objectives to disentangle lexical content from speaking style. We evaluate our
methods on synthetic LibriSpeech-based mixtures comprising several SNR and
overlap conditions; they improve relative word error rates by 19.6% over a
strong baseline, when averaged over all conditions.Comment: To appear at IEEE ICASSP 202
Towards General-Purpose Speech Abilities for Large Language Models Using Unpaired Data
In this work, we extend the instruction-tuned Llama-2 model with end-to-end
general-purpose speech processing and reasoning abilities while maintaining the
wide range of LLM capabilities, without using any carefully curated paired
data. The proposed model can utilize audio prompts as a replacement for text
and sustain a conversation. Such a model also has extended cross-modal
capabilities such as being able to perform speech question answering, speech
translation, and audio summarization amongst many other closed and open-domain
tasks. This is unlike prior approaches in speech, in which LLMs are extended to
handle audio for a limited number of pre-designated tasks. Experiments show
that our end-to-end approach is on par with or outperforms a cascaded system
(speech recognizer + LLM) in terms of modeling the response to a prompt.
Furthermore, unlike a cascade, our approach shows the ability to interchange
text and audio modalities and utilize the prior context in a conversation to
provide better results
Dynamic ASR Pathways: An Adaptive Masking Approach Towards Efficient Pruning of A Multilingual ASR Model
Neural network pruning offers an effective method for compressing a
multilingual automatic speech recognition (ASR) model with minimal performance
loss. However, it entails several rounds of pruning and re-training needed to
be run for each language. In this work, we propose the use of an adaptive
masking approach in two scenarios for pruning a multilingual ASR model
efficiently, each resulting in sparse monolingual models or a sparse
multilingual model (named as Dynamic ASR Pathways). Our approach dynamically
adapts the sub-network, avoiding premature decisions about a fixed sub-network
structure. We show that our approach outperforms existing pruning methods when
targeting sparse monolingual models. Further, we illustrate that Dynamic ASR
Pathways jointly discovers and trains better sub-networks (pathways) of a
single multilingual model by adapting from different sub-network
initializations, thereby reducing the need for language-specific pruning